距离上次更新已经 1391 天了,文章内容可能已经过时。

课程主页:https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/

视频地址:https://www.bilibili.com/video/av46216519?from=search&seid=13229282510647565239

这里回顾CS224N Assignment 2的内容,Assignment 1比较基础,这里从略。

1.Understanding word2vec

(a)

注意只有当w=o时,我们才有yw=1,其余情形均为0,所以

wVocabywlog(y^w)=log(y^o)

(b)

Jnaive-softmax(vc,o,U)=logP(O=o|C=c)=logexp(uovc)wVocabexp(uwvc)=uovc+logwVocabexp(uwvc)

所以

Jnaive-softmax(vc,o,U)vc=uo+oVocabexp(uovc)wVocabexp(uwvc)(uovc)vc=uo+oVocabP(O=o|C=c)uo=Uy+Uy^=U(y^y)

(c)

Jnaive-softmax(vc,o,U)uw=vc1{w=o}+exp(uwvc)wVocabexp(uwvc)(uwvc)uw=vc1{w=o}+P(O=w|C=c)vc=vc(y^wyw)

(d)

计算雅克比矩阵可得

σ(xi)xj=σ(xi)(1σ(xi))1{i=j}

所以

σ(x)x=diag(σ(x)(1σ(x)))

(e)

注意o不属于{1,,K}

Jneg-sample(vc,o,U)vc=σ(uovc)(1σ(uovc))σ(uovc)uok=1Kσ(ukvc)(1σ(ukvc))σ(ukvc)(uk)=(1σ(uovc))uo+k=1K(1σ(ukvc))ukJneg-sample(vc,o,U)uo=σ(uovc)(1σ(uovc))σ(uovc)vc=(1σ(uovc))vcJneg-sample(vc,o,U)uk=σ(ukvc)(1σ(ukvc))σ(ukvc)(vc)=(1σ(ukvc))vc

原始的损失函数中需要求指数和,很容易溢出,要进行处理,但是负采样的损失函数就没有这个问题。

(f)

Jskip-gram (vc,wtm,wt+m,U)/U=mjmj0J(vc,wt+j,U)/UJskip-gram (vc,wtm,wt+m,U)/vc=mjmj0J(vc,wt+j,U)/vcJskip-gram (vc,wtm,wt+m,U)/vw=mjmj0J(vc,wt+j,U)/vw

2.Implementing word2vec

(a)

sigmoid

python
def sigmoid(x):
    """
    Compute the sigmoid function for the input here.
    Arguments:
    x -- A scalar or numpy array.
    Return:
    s -- sigmoid(x)
    """

    ### YOUR CODE HERE
    s = 1 / (1 + np.exp(-x))

    ### END YOUR CODE

    return s

naiveSoftmaxLossAndGradient

注意这里的矩阵是第一题矩阵的转置。

python
### YOUR CODE HERE

### Please use the provided softmax function (imported earlier in this file)
### This numerically stable implementation helps you avoid issues pertaining
### to integer overflow. 
'''
    centerWordVec: 1 * d
    outsideVectors: n * d
    '''
#1 * n
vec = centerWordVec.dot(outsideVectors.T)
#1 * n
prob = softmax(vec)
loss = -np.log(prob[outsideWordIdx])
#1 * d
gradCenterVec = -outsideVectors[outsideWordIdx] + prob.dot(outsideVectors)
#n * d
gradOutsideVecs = prob.reshape(-1, 1).dot(centerWordVec.reshape(1, -1))
#n * d
gradOutsideVecs[outsideWordIdx] -= centerWordVec
### END YOUR CODE

negSamplingLossAndGradient

python
### YOUR CODE HERE

### Please use your implementation of sigmoid in here.
'''
centerWordVec: 1 * d
outsideVectors: n * d
'''
#1 * m
vec = centerWordVec.dot(outsideVectors[indices].T)
vec[1:] *= -1
sig = sigmoid(vec)
tmp = np.log(sig)
loss = -tmp[0] - np.sum(tmp[1:])
#1 * m
t1 = 1 - sig
gradCenterVec = t1.dot(outsideVectors[indices]) - 2 * t1[0] * outsideVectors[outsideWordIdx]
#累加
gradOutsideVecs = np.zeros_like(outsideVectors)
gradOutsideVecs[outsideWordIdx] += -t1[0] * centerWordVec
for i in range(K):
    k = negSampleWordIndices[i]
    gradOutsideVecs[k] += t1[i + 1] * centerWordVec
### END YOUR CODE

(b)

python
### YOUR CODE HERE
loss, grad = f(x)
x -= step * grad

### END YOUR CODE